Matching Information Items

ABSTRACT

In one embodiment, a method of identifying the presence of matching information items in a network includes using a hashing scheme to generate a set of first hash values from a respective set of first information items stored at a first node and transmitting the set of first hash values over the network to a second node. The set of first hash values is compared at the second node with a set of second hash values generated, using the hashing scheme, from a respective set of second information items stored in the network, to thereby determine at least one matching hash value between the set of first hash values and the set of second hash values. The determined matching hash value is used to identify the presence of at least one matching information item between the set of first information items and the set of second information items. The hashing scheme is chosen so that a unique hash value in the hashing scheme indicates a sufficient number of information items to prevent the unique hash value being used as an identifier of a unique information item, such that the transmission of the set of first hash values to the second node does not disclose the set of first information items to the second node.

RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.12/640,844 entitled Matching Information Items filed Dec. 17, 2009 whichclaims priority under 35 U.S.C. §119 or 365 to Great Britain,Application No. 0919675.9, filed Nov. 10, 2009. The entire teachings ofthe above applications are incorporated herein by reference.

BACKGROUND

A network typically comprises a plurality of nodes which can communicatewith each other, such that each node in the network is capable ofcommunicating with at least one other node in the network. The networkmay be for example the internet, but other networks may be usedalternatively or additionally. Nodes can communicate in the networkusing links that may be direct links between the nodes, or alternativelythe links may be indirect links through the network, such that the nodescommunicate with each other via at least one other node in the network.

The network may employ a packet-based communication system. Packet-basedcommunication systems allow the user of a device, such as a personalcomputer, to communicate across the network. One type of packet-basedcommunication system uses a peer-to-peer (“P2P”) topology built onproprietary protocols. To enable access to a peer-to-peer system, theuser must execute P2P client software provided by a P2P softwareprovider on their computer, and register with the P2P system. When theuser registers with the P2P system the client software is provided witha digital certificate from a server. Once the client software has beenprovided with the certificate, communication can subsequently be set upand routed between users of the P2P system without the further use of aserver. In particular, the users can establish their own communicationroutes through the P2P system based on the exchange of one or moredigital certificates (or user identity certificates, “UIC”), whichenable access to the P2P system. The exchange of the digitalcertificates between users provides proof of the users' identities andthat they are suitably authorised and authenticated in the P2P system toengage in communication. Further details on such a P2P system aredisclosed in WO 2005/009019.

A first node in the network may store details of contacts of a firstuser of the first node. The first node, or the first user, may beinterested in determining whether any of the first user's contacts arealso contacts of a second user at a second node in the network. Onemethod for determining if there are any common contacts between thefirst and second users is for the first node to send a list of the firstuser's contacts to the second node. The second node can then compare thefirst and second user's contacts to find common contacts and can returnthe results to the first node.

In another scenario, a social networking site in the network may be usedto find users in the network who share common interests. The users cantransmit identifiers of their interests to a node of the socialnetworking site, so that the social networking site can compare theinterests of different users to identify users with common interests. Arecommendation may then be sent to at least one of the identified usersto indicate that the identified users share common interests.

In a further scenario in which the network is a file sharing network,some nodes in the network are file sharing nodes, whereby files storedat the file sharing nodes can be retrieved by other nodes in thenetwork. Each file sharing node sends identifiers, such as filenames, offiles stored at the sharing node to a central index node. A user wantingto find a desired file in the file sharing network can send a request tothe central index node identifying the desired file. The central indexnode can compare the request with the identifiers received from thesharing nodes to find any matches. If the central index node determinesthat the desired file is stored on a sharing node then the central indexnode can inform the user searching for the desired file of the locationof the sharing node in the network. The user can then contact thesharing node to request the desired file.

The inventor has appreciated that there are two common problems with theseparate scenarios described above. Firstly, the transmission of thecontact details, the identifiers of interests or the identifiers offiles over the network requires a large amount of data to be transmittedacross the network. This reduces the network resources that areavailable for other purposes. Secondly, there may be a security orprivacy problem in the scenarios described above. For example, a firstuser might not want to reveal his contacts to a second user. As anotherexample, users might not want to reveal their interests to a socialnetworking site. Furthermore, a sharing node may not want to reveal thefiles stored at the sharing node to a central index node and also a usersearching for a desired file may not want to reveal the desired file tothe central index node.

SUMMARY

In a first aspect of the invention there is provided a method ofidentifying the presence of matching information items in a network, thenetwork comprising a first node and a second node, the methodcomprising: using a hashing scheme to generate a set of first hashvalues from a respective set of first information items stored at thefirst node; transmitting the set of first hash values over the networkto the second node; comparing the set of first hash values at the secondnode with a set of second hash values generated, using the hashingscheme, from a respective set of second information items stored in thenetwork, to thereby determine at least one matching hash value betweenthe set of first hash values and the set of second hash values; usingthe determined at least one matching hash value to identify the presenceof at least one matching information item between the set of firstinformation items and the set of second information items, wherein thehashing scheme is chosen so that a unique hash value in the hashingscheme indicates a sufficient number of information items to prevent theunique hash value being used as an identifier of a unique informationitem, such that the transmission of the set of first hash values to thesecond node does not disclose the set of first information items to thesecond node.

In a second aspect of the invention there is provided a networkcomprising: a first node comprising means for using a hashing scheme togenerate a set of first hash values from a respective set of firstinformation items stored at the first node; and a second nodecomprising: means for receiving the set of first hash values over thenetwork; means for comparing the set of first hash values with a set ofsecond hash values generated, using the hashing scheme, from arespective set of second information items stored in the network, tothereby determine at least one matching hash value between the set offirst hash values and the set of second hash values, wherein thedetermined at least one matching hash value is used to identify thepresence of at least one matching information item between the set offirst information items and the set of second information items, andwherein the hashing scheme is chosen so that a unique hash value in thehashing scheme indicates a sufficient number of information items toprevent the unique hash value being used as an identifier of a uniqueinformation item, such that the transmission of the set of first hashvalues to the second node does not disclose the set of first informationitems to the second node.

Embodiments of the present invention provide a method of identifyingcommonly held information items in a peer to peer system withoutdisclosing the held information items to another peer in the network.This is achieved by comparing hash values generated from the informationitems while ensuring that the number of possible hash values in thehashing scheme is smaller, and preferably significantly smaller, thanthe total number of information items in the peer to peer system. Thehash values thus collide heavily across the whole set of informationitems in the system. This ensures that a 1:1 correspondence of hashvalues and information items cannot be established, such that disclosinga hash value to a node does not disclose the information item used togenerate the hash value to the node. In this way the invention allowscommonly held information items to be identified without unnecessarilytransmitting the information items over the network (and therebyunnecessarily using network resources) to a node, and withoutdisclosing, or uniquely identifying, the information items to the node.

Hashing schemes are known which may be used to generate hash values frominformation items, such that the hash values can be compared rather thanthe information items in order to find matching information items.Hashing is a technique for compacting information to identifiers (hashvalues), in such a way that both the content of the information itemsand the order of the information items in the list of information itemsare taken into account in generating hash values from a list ofinformation items. Typically, hash values are smaller than theinformation items from which they are generated. The situation in whicha single hash value corresponds to more than one information item iscalled a “hash collision”. Where a hashing scheme is used in which thereare no hash collisions, the identification of matching hash valuesequates to an identification of matching information items identified bythe hash values.

Using hash values rather than the information items themselves reducesthe amount of data transmitted over the network and hides the content ofthe information items. Hash values may be generated using a one-wayhashing algorithm. Hash functions are by definition non-reversible,meaning that the original content from which the hash was calculatedcannot be recreated from the hash value, thereby protecting the originalcontent. This assumes that the content is large enough to make abrute-force attack (which involves generating hash values of allpossible variations of the content) infeasible. It can therefore be seenthat using the hash values provides a level of security to protectagainst the disclosure of information items.

Since hash values are generally smaller than the information items fromwhich they are generated, it is possible that more than one informationitem will generate the same hash value using a particular hash function,i.e. there is a hash collision. Typically, hash functions are carefullychosen to minimize the number of hash collisions in the network. Byminimizing the number of hash collisions, hash values can be used forindexing large data items to save index space. Data items found by hashindex may be further compared to eliminate unwanted records that wereretrieved because of a hash collision. Since the number of hashcollisions in the hashing scheme is minimized, the number of comparisonsof the large data items that is required is minimized. Hash values mayalso be used to detect duplicates in large data sets where directlycomparing data items to many other data items would be prohibitivelycomplicated. The data items are first hashed, and then the resultinghash values are compared. In all of these systems, the hashing scheme ischosen to minimize, or if possible eliminate, the occurrences of hashcollisions. The hashing scheme can be said to have ‘uniqueness’ if thereare no hash collisions, and to have near uniqueness if there are a smallnumber of hash collisions.

In some applications hash functions are chosen in such a way thatsimilar information items produce hash values that are also similar.This property can be used to facilitate data distribution into separate‘bins’ while still preserving a property of hash uniqueness.

In some systems, such as a P2P communication system as described above,privacy concerns make it undesirable to reveal information held at onenode to another node in such a way that the information can be seen, oreven uniquely identified by the other node. For example, in a P2Psystem, although the presentation of digital certificates providessufficient trust in the identity of a user for communication to beestablished with that user across the network, the digital certificatesmight not provide sufficient trust for disclosing information items tothat user. Although using a hashing scheme provides a degree of privacyfor the information items, the usually desirable uniqueness property (ornear uniqueness) of hash functions makes identification of informationitems that generated the hash values possible by creating apre-calculated mapping between information items and their hash values.This is known as a “dictionary attack” in which all of the possible hashvalues are pre-computed and stored in a “dictionary” in association withthe corresponding information items from which they are generated. If ahash value is known, the information item that generates the hash valuecan then be determined using the “dictionary”.

However, in embodiments of the present invention, the hashing scheme isintentionally chosen such that it generates hash values that collide sofrequently that a hash value cannot be reliably used as a uniqueidentifier of an information item. This is contrary to hashing practicein most prior applications in which the hashing scheme is chosen tominimize the occurrence of hash collisions.

Embodiments of the present invention are particularly useful in systemswhere different information items are held at different nodes while thenodes or users of the nodes involved do not trust each other fully. Suchsystems can be P2P networks for use in file sharing or instantmessaging. In one embodiment, the method can be used to find outpotentially interesting information, such as commonly shared contactsbetween users in a P2P instant messaging system. In another embodiment,the method can be used to locate potential sources of information itemswithout revealing exactly what information is being sought, such aslocating a file in P2P file sharing system. In a further embodiment, themethod can be used to identify users in the network with commoninterests without revealing the interests themselves.

BRIEF DESCRIPTION OF THE DRAWINGS

For a better understanding of the present invention and to show how thesame may be put into effect, reference will now be made, by way ofexample, to the following drawings in which:

FIG. 1 is a schematic diagram of a communications system according to apreferred embodiment;

FIG. 2 shows a user interface according to a preferred embodiment;

FIG. 3 represents the network of a first embodiment;

FIG. 4 shows a flowchart of a process for determining common contacts inthe first embodiment;

FIG. 5 represents the network of a second embodiment;

FIG. 6 shows a flowchart of a process for locating files in the secondembodiment;

FIG. 7 represents the network of a third embodiment; and

FIG. 8 shows a flowchart of a process for identifying users with commoninterests in the third embodiment.

DETAILED DESCRIPTION

In embodiments of the invention a P2P communication system is operatedon a network. FIG. 1 illustrates a packet-based P2P communication system100. A first user of the communication system (User A 112) operates auser terminal 102, which is shown connected to the rest of the network111. The user terminal 102 may be, for example, a mobile phone, apersonal digital assistant (“PDA”), a personal computer (“PC”)(including, for example, Windows™, Mac OS™ and Linux™ PCs), a gamingdevice or other embedded device able to connect to the network 111. Thenetwork 111 may be, for example, the internet. The user device 102 isarranged to receive information from and output information to a user112 of the device. In a preferred embodiment the user device 102comprises a display such as a screen and an input device such as akeypad, joystick, touch-screen, keyboard and/or mouse. The user device102 is connected to the network 112 via link 106.

Note that in alternative embodiments, the user terminal 102 can connectto the communication network 111 via additional intermediate networksnot shown in FIG. 1. For example, if the user terminal 102 is a mobiledevice, then it can connect to the communication network 111 via acellular mobile network (not shown in FIG. 1), for example a GSM or UMTSnetwork.

The user terminal 102 is running a communication client 116, provided bythe software provider. The communication client 116 is a softwareprogram executed on a local processor in the user terminal 102.

An example of a user interface 200 of the communication client 116executed on the user terminal 102 of the first user 112 is shownillustrated in FIG. 2. Note that the user interface 200 can be differentdepending on the type of user terminal 102. For example, the userinterface can be smaller or display information differently on a mobiledevice, due to the small screen size. In the example of FIG. 2, theclient user interface 200 displays the username 202 of “User_A” 112 inthe communication system. The client user interface 200 comprises a tab204 labelled “contacts”, and when this tab is selected the contacts ofUser A in the P2P communication system are displayed in a pane 206 belowthe tab 204. In the example user interface in FIG. 2, four contacts ofother users of the communication system are shown listed in pane 206,those being contacts with usernames “User B”, “Amy”, “Rosie” and“Martyn” as shown in FIG. 2. Each of these contacts have authorised UserA 112 of the client 116 to view their contact details.

Returning to FIG. 1, node 102 comprises a store 108 for holdinginformation or data. The information is typically in the form ofdiscrete information items. The information items may be for exampledetails of the contacts of the user 112 or identifiers of files storedat the node 102 or identifiers of interests of the user 112. The link106 connects the user device 102 to a second user device 104 over thenetwork 111. The link 106 may be a direct connection between nodes 102and 104, or alternatively the link 106 may be an indirect connectionbetween nodes 102 and 104 via other nodes in the rest of the network111. Node 104 is associated with User B 114 and comprises acommunication client 118 and an information store 110 similar to thoseof node 102.

It may be desired to identify the presence of matching information itemsbetween information items stored in the stores 108 and 110. However, dueto security and privacy issues, which can be particularly important inP2P systems, it may be undesirable to reveal information items toanother node, or even to uniquely identify the information items to theother node.

By using a hashing scheme as described above, node 102 can generate alist of hash values using a hashing function from the information itemsin store 108 at node 102 and transmit the hash values to the node 104via the link 106. Node 104 can generate hash values from the list 110 ofinformation items in store 110 at the node 104 using the same hashingfunction and compare those hash values with the hash values receivedfrom node 102 to find matching hash values. If the hashing scheme has nohash collisions, the identification of matching hash values equates toan identification of matching information items identified by the hashvalues.

Hash functions with uniqueness or near uniqueness allow foridentification of information items that generated the hash values. Forexample, if node 102 transmits a list of hash values to node 104indicating the information items in store 108, and the hash values aregenerated using a hashing scheme which has no colliding hash values,then if one of the hash values received at node 104 matches a hash valuegenerated from an information item in store 110, then node 104 canconclude that node 102 has an information item matching the informationitem in list 110 which was used to generate the matching hash value.Furthermore, if node 104 is able to determine an information item thatwould generate the hash value then node 104 can conclude that thatinformation item is stored at node 102. In hashing schemes which havenear uniqueness, although it may not be possible to uniquely identify aninformation item from a matching hash value, the information item can beidentified as being one of only a few possible information items.

The inventor has realised that matching hash values can be used toidentify the presence of matching information items using a hashingscheme in which the number of unique hash values is less than the numberof unique information items in the system. This means that the same hashvalue is generated from more than one unique information item in thehashing scheme. The number of unique information items indicated by onehash value in the hashing scheme can be chosen in dependence upon thecontext in which the hash values are used. For example, in one contextwhere the information items are details of a user's contacts, each hashvalue may preferably indicate hundreds of information items, whereas inanother context where the information items are identifiers of filesstored at a sharing node, each hash value may preferably indicate fewerinformation items, for example approximately ten information items. Inpreferred embodiments, the number of unique hash values is at least anorder of magnitude less than the number of unique information items inthe system. In these preferred embodiments, one unique hash value isgenerated from at least 10 unique information items. As an example, thenumber of unique information items may be 300 million, and, depending onthe context, a suitable hashing scheme may be a 16-bit CRC which has atotal of 65536 unique hash values, such that each hash value indicatesan average of 4500 information items. Therefore by disclosing the hashvalue to another node in the network, the information item cannot beuniquely identified (in the example the information item may be one of4500 information items), so the other node is prevented from using thehash value as an identifier of a unique information item. The hash valuecan only be used by the other node to identify that the information itemis one from a group of information items which generate the same hashvalue.

Three particularly useful embodiments will now be described. The firstembodiment is described with reference to FIGS. 3 and 4, the secondembodiment will be described with reference to FIGS. 5 and 6 and thethird embodiment will be described with reference to FIGS. 7 and 8.

The first particularly useful embodiment relates to finding commoncontacts between users in a network. FIG. 3 shows the user nodes 102 and104 being connected via the link 106. The store 108 on node 102, holdsthe usernames of contacts of User A 112. In the example shown in FIG. 3,the store 108 holds the usernames “Amy”, “Rosie” and “Martyn”. The store110 on node 104, holds the usernames of contacts of User B 114. In theexample shown in FIG. 3, store 110 holds the usernames “Madis”,“Martyn”, “Sonia” and “Michael”. User A 112 of node 102 would like todetermine whether any of his contacts are also contacts of User B 114 ofnode 104, but User A does not want to disclose his list of contacts toUser B at node 104. In order to do this, the method shown in FIG. 4 isused.

In step S402, node 102 generates hash values of the contact usernamesstored in store 108 according to a hashing scheme as described above. Instep S404, the hash values of User A's contacts are transmitted over thenetwork on link 106 to node 104. As described above, one hash value inthe hashing scheme corresponds to more than one contact username (i.e.the ratio of hash values to contact usernames in the system is 1:many).Therefore, node 104 cannot reliably determine the contacts of User A 112using the hash values provided from node 102.

In step S406, node 104 generates hash values of the contact usernamesstored in store 110 according to the same hashing scheme as that used togenerate the hash values of the contact usernames of User A 112.

In step S408, the hash values received from node 102 are compared withthe hash values generated at node 104 to find matching hash values. Instep S410, the matching hash values are transmitted over the network tonode 102 via link 106. In step S412, node 102 determines the contactusernames that were used to generate the matching hash values, in orderto determine common contacts between User A 112 and User B 114.

In the example shown in FIG. 3, the user in the system with username“Martyn” is a contact of both User A 112 and User B 114. The hash valuegenerated for username “Martyn” will be found as a matching hash valuein step S408, and then in step S412 it will be determined that Martyn isa common contact of User A 112 and User B 114. The hashing scheme ischosen to have significantly less unique hash values than there areusers in the system, such that one hash value may be generated from manyusernames. However, there are significantly more unique hash values inthe hashing scheme than there are contact usernames in the stores 108and 110. As an example, in the whole communication system there may be300 million users with different usernames. A suitable hashing scheme isa 16-bit CRC which has a total key space of 65,536, i.e. there are65,536 unique hash values which can be used to indicate the usernames.In this example each unique hash value would indicate an average of 4500usernames in the entire P2P system, but User A has three contacts andUser B has four contacts, so it is unlikely that more than one contactof User A or User B will be indicated by the same hash value. In theexample shown in FIG. 3, the probability of a random hash collision isgiven by: (no. of User A's contacts)x(no. of User B's contacts)/(no. ofunique hash values in hashing scheme)=12/65536≈0.00018. This is so lowthat in this case, any matching hash values can be taken to beindicative of matching contacts.

The reason that matching hash values can be assumed to be indicative ofmatching contacts despite each hash value representing 4500 contactusernames on average in this case is because it is more likely that theUsers 112 and 114 have common contacts than it is that two differentcontacts of user 112 or user 114 randomly generate the same hash value.This is true because there is an association between the contacts ofUser A 112 and the contacts of User B 114. In particular, User A 112knows User B (User B 114 might be a contact of User A 112, or User A 112might simply be aware of User B 114), which means that a contact of UserB 114 is much more likely, than a completely random user in the network111, to be a contact of User A. In this way, by choosing a suitablehashing scheme, it can be ensured that when comparing contacts of userswho know each other, matching hash values predominantly indicatematching contacts rather than hash collision of different contacts. Thisallows matching hash values to be assumed to be indicative of matchingcontacts. One particularly useful scenario in which the method can beused is for allowing a Friends of Friends search to be executed on P2Pnodes without revealing the contact names between P2P nodes that areparticipating in the search, while still allowing the identification ofshared contacts.

This method is not limited to use in determining common contacts betweenusers who know each other. Indeed the method will work for any type ofinformation item stored in lists on different nodes in the network,where there is an association between the lists which increases thelikelihood that information items in one list will be present in theother list. Where the lists are associated in such a way that theprobability of one of the information items in one list matching one ofthe information items in the other list is greater than the probabilityof two different information items being identified by the same hashvalue in the hashing scheme, then matching hash values can be taken tobe indicative of matching information items. In this way, commoninformation items between the stores 108 and 110 can be found withoutdisclosing the information items stored in node 108 or in node 110 toany node in the network. A node receiving the hash values of theinformation items in store 108 from node 102 cannot determine theinformation items in store 108 because each hash value identifies manydifferent information items in the hashing scheme.

In an alternative embodiment, hash values of the information items instore 108 may be transmitted to a third node (not shown in FIG. 3), andhash values of the information items in store 110 may also betransmitted to the third node, wherein the hash values are compared atthe third node to find matching hash values between the stores 108 and110. In this embodiment, the hash values are not sent to the node 104,which may be beneficial if User A 112 has more trust in the third nodethan in node 104 and/or in User B 114.

A second particularly useful embodiment is described with reference toFIGS. 5 and 6 and relates to file sharing in a network. FIG. 5 shows afirst sharing node 502 storing File 1, File 2 and File 3 in a store 508and a second sharing node 504 storing File 11, File 12 and File 13 in astore 510. The sharing nodes 502 and 504 can communicate with a centralindex 512 over the network via respective links 506 and 507. Similarlyto link 106 described above, links 506 and 507 may be direct or indirectlinks over the network.

In step S602 hash values of the files stored at sharing node 502 aregenerated at node 502 using a suitable hashing scheme as described aboveto prevent the hash values being used as identifiers of unique files andhash values of the files stored at sharing node 504 are generated atnode 504 using the same hashing scheme. In step S604 the hash valuesgenerated at node 502 are transmitted over link 506 to the central index512 and the hash values generated at node 504 are transmitted over link507 to the central index 512. The hash values are stored in store 514 atthe central index 512. There may be many sharing nodes in the network,but only two (nodes 502 and 504) are shown in FIG. 5 for clarity. Thestore 514 stores the hash values in such a way that it can associate ahash value in the store 514 with the one of the sharing nodes (e.g. 502or 504) from which the hash value was received. This may be done by anysuitable method such as by storing the hash values from differentsharing nodes in different lists, or by linking each hash value in thestore 514 to the relevant sharing node.

A requesting node 102 such as node 102 shown in FIG. 1 associated withUser A 112 can communicate with central index 112 over the network via alink 505. Similarly to link 106 described above, link 505 may be adirect or an indirect link over the network. Node 102 has a store 108which can store one or more hash values of files that the node 102 wouldlike to find in the network. For example as shown in FIG. 5, the user112 wants to find File 2 and the store 108 stores “Hash 2” whichcorresponds to File 2 stored at sharing node 502. In step S606 therequesting node transmits a request to the central index using link 505to locate the file which corresponds to the hash value stored in thestore 108. The request includes the hash value “Hash 2” from the store108.

In step S608 the hash value received from the requesting node 102 iscompared with the hash values stored in store 514 at the central index512 to determine any matching hash values. The location(s) of sharingnodes which store files identified by the matching hash values is(are)determined. Since a hash value corresponds to many files in the hashingscheme, it is possible that a matching hash value does not indicate thecorrect file that node 102 is searching for. However, the location(s) ofthe sharing nodes determined in step S608 are returned to the requestingnode 102 in step S610. For example, in the system shown in FIG. 5, thelocation of the first sharing node 502 will be returned to therequesting node because Hash 2 corresponds to File 2 stored at node 502.However, it is also possible that, for example, Hash 11 generated fromFile 11 is the same as Hash 2 generated from file 2, so the location ofthe second sharing node 504 may also be returned to the requesting node102.

In step S612 the requesting node 102 contacts the identified sharingnode(s) over the network to determine whether the correct file (File 2)is stored at the sharing node(s). This may be done by transmitting anidentifier of the file to the sharing node that is more precise than thehash values used previously. The more precise identifier may be adifferent hash value calculated using a different hashing scheme. In thedifferent hashing scheme, a hash value may identify a unique file.Alternatively, the more precise identifier may be the filename of thefile. Other identifiers may be used as would be apparent to the skilledperson. In step S614, the sharing node(s) use the more preciseidentifier to determine whether the sharing node holds the file that therequesting node is searching for. In the example shown in FIG. 5, thefirst sharing node 502 will determine that File 2 is the file that therequesting node 102 is searching for, and may then transmit the file tothe requesting node 102. However, the second sharing node 504 willdetermine that File 11 is not the file that the requesting node issearching for even though Hash 11 happened to match Hash 2. Even if thefile at the sharing node is the file that the requesting node issearching for, the sharing node may decide not to transmit the file tothe requesting node.

In some embodiments, in step S612, the requesting node will be requiredto send an authentication to the sharing node. The sharing node willthen check the authentication and only if the requesting node isauthenticated will the sharing node transmit the file to the requestingnode in step S614. In a P2P system as described above, theauthentication could be the digital certificate of the requesting node102.

When the method is used in a system for file sharing such as thatdescribed above in relation to FIGS. 5 and 6, the central index canstore hash values of files stored at sharing nodes, and because the hashvalues do not identify unique files, the actual files stored at thesharing nodes cannot be determined from an inspection of the centralindex. Furthermore, the requesting node 102 can send a request to thecentral index 512, and the actual information being searched for is notdisclosed. However, it is still possible to use the central index tosearch for files stored on sharing nodes in the network using the methoddescribed above.

The method of the second particularly useful embodiment is describedabove with reference to FIGS. 5 and 6 in relation to information itemsbeing identifiers of files stored at the nodes. However, the method canbe used for other information items where it is desired to identifysharing nodes which are storing particular information items.

A third particularly useful embodiment is described with reference toFIGS. 7 and 8 and relates to finding users with similar interests in anetwork. FIG. 7 shows a first user node 102 which is associated withUser A 112. The store 108 on node 102 holds identifiers of interests ofUser A 112. Similarly, node 702 is associated with User C 710 andcomprises a store 708 which holds identifiers of interests of User C710. A central node 712 in the network can communicate with node 102 vialink 705 and can communicate with node 702 via link 706. Similarly tolink 106 described above, links 705 and 706 may be direct or indirectlinks over the network.

As examples, URLs and search keywords can be used as identifiers of auser's interests. Images viewed by a user from a database of images canalso be used as identifiers of a user's interests.

In step S802 node 102 generates a set of hash values from theidentifiers of the interests in store 108 using a suitable hashingscheme as described above. Similarly, node 702 generates a set of hashvalues from the identifiers of the interests in store 708 using the samehashing scheme. In step S804 the set of hash values generated at node102 is transmitted to the central node 712 using the link 705 and theset of hash values generated at node 702 is transmitted to the centralnode 712 using the link 706. The sets of hash values are stored at thecentral node 712. In the example shown in FIG. 7, the set of hash valuesreceived from node 102 is stored in a store 714 and the set of hashvalues received from node 702 is stored in a store 716. In otherembodiments, sets of hash values received from different nodes arestored in the same store at the central node 712 with the central node712 having some mechanism for identifying the node from which the setsof hash values were received.

In step S806, the set of hash values received from node 102 is comparedwith the set of hash values received from node 702 at the central node712 to find matching hash values between the sets. In step S808 thenumber of matching hash values between the sets is counted. The numberof matching hash values gives an indication as to whether the users 112and 710 have similar interests. Since in the hashing scheme each hashvalue identifies many different interests, some matching hash values maynot relate to matching interests. Indeed, if the interests of the users112 and 710 were selected completely at random from all of the intereststhat may be identified in the system, there would be an expected numberof matching hash values between the two sets that would depend upon thesize of the two sets of hash values and upon the number of unique hashvalues in the hashing scheme used to generate the hash values.

In step S810 is it determined whether the number of matching hash valuesis greater than the expected number of matching hash values between thetwo sets of interests if the interests were randomly selected. If thereare more matching hash values than expected then that indicates that theusers 102 and 710 have similar interests. The level of similaritybetween the interests of the two users can be quantified by the amountby which the number of matching hash values exceeds the expected numberof matching hash values. Indeed, it is possible to attribute a strengthvalue to the similarity of the interests between the users. Thisinformation can be used for many purposes. For example, if the centralnode 712 is used to host a social networking group, the strength of thesimilarity between two users who are part of the social networking groupcould be used for one of the users to identify the other user as someonewho has similar interests to his own.

Because the hashing scheme uses hash values that do not identify uniqueinterests, the method allows information that relates to interests ofpeople to be used to locate people with common interests withoutrevealing exactly what the interests are. The hash values stored on thecentral node 712 cannot be used to uniquely identify the interests of auser. The websites that a person visit can be very indicative ofinterests that a person has. This means that URLs used by a user can beused as the identifiers of the interests of the user. Similarly, searchkeywords can be used as the identifiers of the interests. The number ofhash values in common between two lists can be considered indicative ofcommon interests.

The method can also be applied to files, for example by generating hashvalues of image file contents of an image sharing site and comparinglists of hash values between two users it can be determined whether theylike viewing similar images. As an example, if an average user hasviewed 2000 images from a database of 2 million images then using a16-bit hash value (with 65536 unique hash values) would sufficientlywell identify images in two lists of images viewed by users while eachhash value would match any of 30 different images. Each matching hashvalue increases the likelihood that other matching hash values do indeedrefer to matching images out of the 30 possibilities in the system. Oncea sufficient number of matching hash values are found then there is ahigh probability that the two users have looked at same images. In otherwords, if there is a large number of matching hash values, then the twousers are identified as having similar interests, and once they areidentified as having similar interests then there is an associationbetween the sets of interests of one user and the set of interests ofthe other user. This can then be thought of in a similar way as in thefirst particularly useful embodiment described above in which thecontact lists of users who know each other are associated. In the sameway, where two users have similar interests, then the interests areassociated such that it is more likely that another matching hash valueis due to a matching interest rather than being due to a random hashcollision of different interests. Therefore, once the users 102 and 710have been identified as having similar interests, further matching hashvalues can be assumed to identify matching interests.

In the example shown in FIG. 7, users 102 and 710 both have interests 2and 3 and the matching hash values will be determined in the centralnode 712 in step S806. Having two matching hash values will be more thanexpected for the case where each user has only 3 interests as shown inFIG. 7 and so it will be determined that users 102 and 710 have similarinterests.

The method of the third particularly useful embodiment is describedabove with reference to FIGS. 7 and 8 in relation to information itemsbeing identifiers of interests of users. However, the method can be usedfor other information items where it is desired to identify nodes whichhave similar information items.

The embodiments described above, use a hashing scheme with fewer uniquehash values than there are unique information items in the system suchthat one hash value cannot be used to identify a unique informationitem. This ensures privacy and security in the system since a node cantransmit the hash values generated from information items stored on thenode to another node and the other node cannot determine the informationitems using the hash values. Hence, the privacy and security areimproved whilst the method maintains the ability to identify thepresence of matching information items on different nodes in the system.

While this invention has been particularly shown and described withreference to preferred embodiments, it will be understood to thoseskilled in the art that various changes in form and detail may be madewithout departing from the scope of the invention as defined by theappendant claims.

1. A method of identifying the presence of matching information items ina network, the method comprising: using a hashing scheme to generate aset of first hash values from a respective set of first informationitems stored at a first node; transmitting the set of first hash valuesfrom the first node over the network to a second node effective to causethe second node to: compare the set of first hash values at the secondnode with a set of second hash values generated, using the hashingscheme, from a respective set of second information items stored in thenetwork, to thereby determine at least one matching hash value betweenthe set of first hash values and the set of second hash values; and usethe determined at least one matching hash value to identify the presenceof at least one matching information item between the set of firstinformation items and the set of second information items; and whereinthe hashing scheme is chosen so that each hash value is unique withrespect to the other hash values and any particular hash value in thehashing scheme indicates a sufficient number of information items toprevent the particular hash value being used as an identifier of aunique information item, such that the transmission of the set of firsthash values to the second node does not disclose the set of firstinformation items to the second node.
 2. The method of claim 1 whereinthe set of second hash values comprises a plurality of second hashvalues and the set of second information items comprises a respectiveplurality of second information items.
 3. The method of claim 1 whereinthe set of first hash values comprises a plurality of first hash valuesand the set of first information items comprises a respective pluralityof first information items.
 4. The method of claim 1 wherein the set ofsecond information items is stored on the second node.
 5. The method ofclaim 1 wherein there is an association between the set of firstinformation items and the set of second information items such that theprobability that one of the first information items matches one of thesecond information items is greater than the probability that twodifferent information items are identified by the same hash value in thehashing scheme, such that matching hash values are assumed to beindicative of matching information items.
 6. The method of claim 5wherein the first information items identify contacts of a first user ofthe first node, and the second information items identify contacts of asecond user in the network.
 7. The method of claim 1 wherein thetransmitting is further effective to cause the second node to: count thenumber of matching hash values between the set of first hash values andthe set of second hash values; and determine whether the counted numberof matching hash values is greater than the expected number of matchinghash values based on the number of first hash values, the number ofsecond hash values and the hashing scheme; and if the counted number isgreater than the expected number, identify that the set of firstinformation items is similar to the set of second information items. 8.The method of claim 7, wherein the transmitting is further effective tocause the second node to use the difference between the counted numberand the expected number to attribute a strength value to the similaritybetween the set of first information items and the set of secondinformation items.
 9. The method of claim 7 wherein the firstinformation items identify interests of a first user of the first node,and the second information items identify interests of a second user inthe network, and wherein the identification that the set of firstinformation items is similar to the set of second information itemsidentifies that the first and second users have similar interests. 10.The method of claim 9 wherein the first and second information items areone of Uniform Resource Locators, search keywords or images.
 11. Themethod of claim 1, wherein the set of second information items is storedon a third node in the network, and wherein transmitting the set offirst hash values from the first node over the network to the secondnode effective to cause the second node to use the determined at leastone matching hash value to identify the presence of at least onematching information item comprises: receiving the location of the thirdnode over the network at the first node; determining the one of thefirst information items that corresponds to one of the at least onematching hash value; transmitting an identifier of the one of the firstinformation items from the first node to the third node effective tocause the third node to: selectively determine the one of the secondinformation items that corresponds to the one of the at least onematching hash value; and selectively determine whether the identifier ofthe one of the first information items also identifies the one of thesecond information items.
 12. The method of claim 11, further comprisingtransmitting, from the first node, an authentication request to thethird node and wherein the transmitting the identifier of the one of thefirst information items from the first node to the third node iseffective to cause the third node to selectively determine the one ofthe second information items and selectively determine whether theidentifier identifies the one of the second information items theauthentication request is accepted by the third node.
 13. The method ofclaim 11 wherein the one of the first information items is moreprecisely identified by the identifier than by the one of the at leastone matching hash value.
 14. The method of claim 13 wherein theidentifier uniquely identifies the one of the first information items.15. The method of claim 11 wherein the first and second informationitems identify files and the method further comprises, if it isdetermined that the identifier identifies the one of the secondinformation items, receiving the file identified by the one of thesecond information items from the third node at the first node.
 16. Themethod of claim 1 wherein the total number of unique hash values in thehashing scheme is at least an order of magnitude less than the totalnumber of information items stored in the network.
 17. The method ofclaim 1 wherein the total number of unique hash values in the hashingscheme is at least an order of magnitude greater than the total numberof information items in the list of first information items and in thelist of second information items.
 18. A network comprising: a first nodecomprising: means for receiving a set of first hash values over anetwork from a second node, the hash values generated by the second nodefrom a respective set of first information items stored at the secondnode using a hashing scheme; means for comparing the set of first hashvalues with a set of second hash values generated, using the hashingscheme, from a respective set of second information items stored in thenetwork, to thereby determine at least one matching hash value betweenthe set of first hash values and the set of second hash values, whereinthe determined at least one matching hash value is used to identify thepresence of at least one matching information item between the set offirst information items and the set of second information items, andwherein the hashing scheme is chosen so that each hash value is uniquewith respect to the other hash values and any particular hash value inthe hashing scheme indicates a sufficient number of information items toprevent the particular hash value being used as an identifier of aunique information item, such that the transmission of the set of firsthash values to the first node does not disclose the set of firstinformation items to the first node.
 19. A method comprising: using ahashing scheme to generate a set of hash values from a respective set ofinformation items stored at a first node, each of the hash values beingunique with respect to the other hash values and any particular hashvalue in the hashing scheme indicating a sufficient number ofinformation items to prevent the particular hash value being used as anidentifier of a unique information item; and transmitting the hashvalues from the first node over a network to a second node such that thetransmission of the set of hash values to the second node does notdisclose the set of information items to the second node.
 20. The methodof claim 20 wherein the total number of unique hash values in thehashing scheme is at least an order of magnitude less than a totalnumber of information items stored in the network.